Evaluating Web Content Quality via Multi-scale Features 
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ABSTRACT 

Web content quality measurement is crucial to various web 
content processing applications. This paper will explore 
multi-scale features which may affect the quality of a host, 
and develop automatic statistical methods to evaluate the 
Web content quality. The extracted properties include sta- 
tistical content features, page and host level link features 
and TFIDF features. The experiments on ECML/PKDD 
2010 Discovery Challenge data set show that the algorithm 
is effective and feasible for the quality tasks of multiple lan- 
guages, and the multi-scale features have different identifi- 
cation ability and provide good complement to each other 
for most tasks. 

Categories and Subject Descriptors 

H.5.4 [Information Interfaces and Presentation]: Hy- 
pertext/Hypermedia; K.4.m [Computer and Society]: Mis- 
cellaneous; H.4.m [Information Systems]: Miscellaneous 

General Terms 

Measurement, Experimentation, Algorithms 

Keywords 

Web Spam, Web Content Quality, Quality Assessment 

1. INTRODUCTION 

The evaluation of Web content quality plays an important 
role for various Web content processing applications, such as 
search engine, Web archiving service and Internet directory, 
etc; but how to evaluate the quality of the Web content? 
In the past, most data quality measures were developed on 
an ad hoc basis to solve specific problems, and fundamental 
principles necessary for developing stable metrics in practice 
were insufficient [4]. In the research of Web content quality 
assessment, computational models that can automatically 
predict the Web content quality should be focused on. 
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Web spam can significantly deteriorate the quality of search 
engine results, but high quality is more than just the oppo- 
site of Web spam. ECML/PKDD 2010 Discovery Challenge 
(DC2010) aims at more aspects of the Web sites. DC2010 
wants to develop site-level classification for the genre of the 
Web sites (editorial, news, commercial, educational, "deep 
Web" or Web spam and more) as well as their readability, 
authoritativeness, trustworthiness and neutrality [2]. 

Statistical learning methods have demonstrated their ef- 
fectiveness for many classification problems, such as Web 
spam detection, text categorization and anti-phishing [1] [3] 
[5] [6] [13] [19], which inspires us to evaluate Web content 
quality with statistical learning algorithms. In this paper, 
we will explore a series of features from multiple views, and 
compare their effectiveness for Web content quality assess- 
ment. 

The rest of this paper is organized as follows. Section 2 
first introduces the multi-scale features extraction, among 
which the description of TFIDF and host level link features 
extracting are our focus. Then it discusses the feature fusion 
strategy. Section 3 gives the Web content quality assessment 
method. Section 4 presents our experiment results. Finally, 
section 5 draws the conclusion and discusses the future work 
on Web content quality assessment. 

2. FEATURES EXTRACTION 

In this section, we will describe multi-scale features ex- 
tracted from four different views, including content statis- 
tics features, page level link related features, host level link 
related features and text features(TFIDF) [5], and give the 
feature fusion strategy. 

2.1 Content, Link and TFIDF Features 

The content features and page level link features used here 
are provided by the ECML/PKDD 2010 Discovery Chal- 
lenge organization committee, i.e. content-based features 
and link-based features [2]. 

We compute the TFIDF [5] features with term frequencies 
and document frequencies provided by DC2010[2]: 



flifc = fik X lof 



rii 



(1) 



where Uik is the weight of word i in document k, fit the 
frequency of word i in document k, N the number of docu- 
ments in the collection, and rii the total number of the word 
i occurs in the whole collection. 

After computing a^fc, feature selection is performed. Fea- 
ture selection attempts to remove non-informative terms in 
order to improve the classification performance and reduce 
the computation complexity. In this paper, we select in- 
formation gain (IG) for feature selection. IG measures the 
number of bits of information obtained for the category pre- 
diction by knowing the presence or absence of a word in 
the document. Information gain has been proved to be one 
of the most effective feature selection methods for text cat- 
egorization[5], statistical spam filtering[9] and information 
retrieval [16], etc. 

2.2 Host Level Link Related Features 

PageRank[ll] is one of the most famous link analysis al- 
gorithms, which reflects the importance of Web pages. With 
the growing prevalence of link spam, PageRank scores be- 
come unreliable as a quality measure. Considering the hy- 
potheses which benign nodes tend to link to other high qual- 
ity nodes and malicious nodes are mainly linked by low qual- 
ity nodes, we will extract a series of host link analysis fea- 
tures and attempt to mine the quality relations from the 
topology dependency. 

Let weight = /(n) be a weighting function, where n is the 
number of links between any host pair {h,v){h £ V,v (z V) 
and E be the set of edges with weight > W, then the host 
graph G can be defined as G = {V,E, weight). Considering 
the topological dependencies of low and high quality nodes, 
the following features related to host graph can be extracted: 



Fi{h) = Measure{h) 



(2) 



„ ,, , E„e/„K„fe(;,) Measure{v) * weight{h, v) 



F3{h) 



T,veiniink(h)^eight{h,v) 

J2veOutiink{h) Measure{v) * weight{h,v) 
T,^eOumnkih)'^ei9ht{h,v) 



(4) 



where Measure £ {IIostRank(Host Level PageRank) , Do- 
mainPR, Truncated PageRank(T — 1, 2, • • • ), Adaptive Es- 
timation of Supporters(d — 1, 2, • • • )}^[10], HostRank is com- 
puted based on the host graph G, and DomainPR is the 
rank value of a host corresponded domain, which is queried 
from http://toolbarqueries.google.com, for example, the Do- 
mainPR of impressum.dukemaster.eu is the same as that of 
dukemaster.eu. {h, v} C V, Inlink{h) is the inlink set of h 
and Outlinkih) is the outlink set of h. weight{h, v) is the 
weight of host h and v, weight{h,v) £ {l,log(n),n}, where 
n is the number of hyperlinks between h and v. 

In our experiments, W ^1,T £ {1, 2, 3, 4}, d £ {1, 2, 3, 4} 
and weight{h,v) £ {l,n}. Finally, we extract 50 host level 
link features. 

2.3 Feature Fusion Strategy 

To analyze the effectiveness of features of different scales, 
we train classifiers with the fusion of different features. Fig.l 
shows the flow chart of fusion strategy with all the above- 
mentioned features. 
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Figure 1: Flow Chart of Web Content Quality As- 
sessment via Multi-scale Features. 

3. WEB CONTENT QUALITY ASSESSMENT 
3.1 Quality Assessment Strategy 

Web content quality assessment is in fact a quality predic- 
tion problem. In ECML/PKDD 2010 Discovery Challenge, 
the quality value is defined based on genre, trust, factuality 
and bias. Typically, DC2010 gives the discrete value em- 
pirically: The Spam host has quality 0; News/Editorial and 
Educational sites are worth 5; Discussion hosts are worth 4 
while others are worth 3. DC2010 also gives 2 bonus scores 
for Facts or Trust, but subtracts 2 for Bias hosts. 

In general, we can first classify the Web sites according 
to the categories: Web Spam, News/Editorial, Commercial, 
Educational/Research, Discussion, Personal/Leisure, Neu- 
trality, Bias and Trustiness. Then we further compute the 
Web content quality with the criteria given by DC2010. 
However, the state of art classification methods may be in- 
appropriate for the ranking: 

• Most given classes are imbalanced. Therefore, training 
an effective model is difficult. 

• The predicted probability for every class cannot be 
fully used to rank the Web content quality. 

• The discrete predictions are unfavorable to ranking the 
hosts. 

Considering the Web content quality values are discrete 
values, we first treat the Web content quality assessment as 
a multi-class classification problem, thus the aforementioned 
shortcomings will be overcome in great degree. Then based 
on the predicted probabilities of samples belonging to each 
classes, the quality values of the Web sites are computed as 
follows: 



Quality{h) = ^ Pi{h) x Q{i) 



(5) 



i=0 



^In practice, in order to make things easier, we use the log- 
arithm of all these values. 



where N is the number of classes, Qualityih) is the quality 
of host h, Pi{h) is the predicted probability that the host 
h belongs to class i, X^^^g Pi(h) ~ 1- Q(i) is the quality 
value of class i, for ECML/PKDD 2010 Discovery Challenge 
quality tasks (Task2 and Task3), Q{i) — i, N — 10, i.e. 

ie{o,i,--- ,9}. 

3.2 Learning Algorithms 

For the above-mentioned Web content quality assessment 
strategy, the most important is to predict the posterior prob- 
ability of examples belonging to each class effectively. Then, 



how to estimate the posterior probabihty as accurate as pos- 
sible? Fan et al. [17] argue that randomized decision tree 
methods effectively approximate the true probability distri- 
bution using the decision tree hypothesis space. When bag- 
ging[7] is applied to C4.5[12], each random tree is computed 
based on a bootstrap of the training samples, which further 
optimizes the posterior probability predictions. 

4. EXPERIMENTS 

4.1 Data Collection 

We realize our algorithms on ECML/PKDD 2010 Discov- 
ery Challenge dataset [2]. In the experiments, we use all 
the labeled samples of English host as the training samples 
set for Taskl and Task2. DC2010 only provides few labeled 
samples for French and German Tasks. We put all the la- 
beled examples including English, French and German into 
training set for the multilingual quality tasks(Task3). The 
test set which we use in this paper is the test set for DC2000 
contest. 

In our experiments, we assume that the host with www 
and the www-less version have the same quality. After re- 
moving the duplicated samples, we obtain the English train- 
ing set with 2114 samples, French training set with 2400 
samples, and German training set with 2400 samples. 

4.2 Features 

We use all the content-based features and link-based fea- 
tures provided by DC2010[2]. For TFIDF features, we se- 
lect 500 dimensions with the top information gain values. 
We have also done experiments by selecting 1000, 1500 and 
2000 TFIDF features and find there is no obvious difference 
for the performance. Besides, we extract 50 host level link 
features as mentioned in section 2.2. 

4.3 Learning Algorithm and Evaluation 

As described in section 3, the machine learning algorithm 
we use is bagging, with C4.5 decision tree as the weak clas- 
sifier. In the experiments, the iterations of C4.5 in bagging 
are 90. 

Normalized Discounted Cumulative Gain(NDCG)[14] is a 
measure of effectiveness of a Web search engine algorithm 
or related applications, which is often used in information 
retrieval. NDCG is also employed for evaluating the sub- 
missions for ECML/PKDD 2010 Discovery Challenge [2]. 
As for the detailed evaluation, please refer to DC2010 eval- 
uation [15]. 

4.4 Experiment Results 

Table 1 describes the NDCG performance with different 
features on Discovery Challenge 2010 Taskl. In line 1, L de- 
notes the page level link related features; H denotes the host 
level link features; C denotes content statistical features; T 
denotes the TFIDF features, HCT denotes the fusion of host 
level link features, content features and TFIDF features; and 
LHCT denotes the fusion of all the above-mentioned differ- 
ent scale features. The first column of the tables shows the 
subtask in Taskl. The column 2 to 7 are the performances 
of the quality assessment method with different features on 
all the subtasks. 

In table 1, the bold figures show the best values achieved 
for corresponding subtasks. We can see that fusion features 



Table 1: Comparisons of Web content quality as- 
sessment performance with different features on 
Taskl (NDCG) 



Task 



H 



C 



T 



HCT LHCT 



Spam 


0.628 


0.789 


0.784 


0.756 


0.830 


0.807 


News 


0.549 


0.589 


0.625 


0.743 


0.740 


0.748 


Commercial 


0.715 


0.741 


0.753 


0.88 


0.883 


0.883 


Educational 


0.726 


0.808 


0.805 


0.872 


0.885 


0.884 


Discussion 


0.638 


0.573 


0.768 


0.822 


0.784 


0.79 


Personal 


0.594 


0.728 


0.768 


0.804 


0.828 


0.827 


Neutrality 


0.605 


0.511 


0.426 


0.438 


0.465 


0.495 


Bias 


0.525 


0.606 


0.518 


0.525 


0.51 


0.549 


Trustiness 


0.526 


0.506 


0.472 


0.358 


0.485 


0.441 



Average 



0.612 0.65 0.658 0.689 0.712 0.714 



are more effective for most subtasks. According to the av- 
erage values, we achieve the best result with fusing all the 
features(LHCT), which indicates that the features extracted 
from different views can be complementary for the DC2010 
classification task. 

Table 2 shows the comparison of Web content quality as- 
sessment performance with different scale features for En- 
glish task. The features used here is the same as that on 
Taskl. 

Table 2: Comparisons of Web content quality as- 
sessment performance with different features on 
Task2(NDCG) 



Task 


L 


H 


C 


T 


HCT LHCT 


English Task 


0.888 


0.914 


0.918 


0.933 


0.936 0.935 



In table 2, we can see that link features gives the least 
effective result, and TFIDF features show the highest score. 
The fused features, such as HCT and LHCT, improve the 
performance slightly. 

Table 3 gives the comparison of Web content quality as- 
sessment performance with different scale features for French 
and German task. In view of all the labeled hosts are used 
for the multilingual quality task, we only employ the page 
level link features and host level link features to avoid the 
semantic influence of different languages. 



Table 3: Comparisons of Web content quality as- 
sessment performance w^ith different features on 
Task3(NDCG) 



Task 


L 


H 


LH 


German Task 
French Task 


0.792 
0.805 


0.87 
0.84 


0.854 
0.833 


Average 


0.799 


0.855 


0.844 



In table 3, we can see that host level link features are 
more effective for the cross-linguistic quality tasks. The host 
level link features and page level link features we use are not 
complementary. 

According to the previous description of NDCG perfor- 
mance on all the tasks, we can see that the host level link 
features are robust for most tasks. Wc can also find that 
multi-scale features fusion are necessary for statistical Web 
content quality assessment. 



5. CONCLUSIONS 

In this paper, we explore multi-scale features that may 
determine the quality of a host and develop automatic sta- 
tistical methods to estimate Web content quality. 

The effectiveness of the multi-scale features is analyzed 
on DC2010 benchmark. The experiments show that the fea- 
tures from different perspectives have different identification 
ability and can complement each other in some degree. For 
most tasks, we achieve the best evaluation results with fused 
features. The experiments also illustrate the feasibility of 
the proposed Web content quality assessment strategy. 

Compared with our previous work on Web spam detec- 
tion[6], this paper has the following differences: (a). In the 
aspect of targets, [6] is a detection question, but this paper 
aims at a ranking problem, (b). In respect of methods, [6] 
focus on improving the AUG performance of binary classi- 
fication, but this paper draws support from the posterior 
probability of multiple classification to rank the Web con- 
tent quality, (c). In terms of features, we use more features 
here, for example TFIDF features and DomainPR related 
features, etc. 

Future work involves: (a). Extract more features, such as 
natural language processing features, (b). Explore effective 
feature fusion strategy, (c). Study new quality assessment 
algorithms, such as learning the idea of RankBoost[18]. 
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